-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A fast netcdf output writer #433
Conversation
- Writes a geometry file - Makes provisions for slicing field arrays and associated dimesions
- Implements a fully functional netcdf output writer with slicing - Example demonstrating usage
Codecov Report
@@ Coverage Diff @@
## master #433 +/- ##
==========================================
+ Coverage 73.22% 73.34% +0.11%
==========================================
Files 27 27
Lines 1505 1508 +3
==========================================
+ Hits 1102 1106 +4
+ Misses 403 402 -1
Continue to review full report at Codecov.
|
I think that if we want to merge this we should nuke the other output writer based on NetCDF and call this one the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts:
-
Let's have just one netcdf-based output writer. Nuke the other one if this one is better.
-
Let's depend on just one NetCDF package. If
NCDatasets.jl
is preferred, then lets remove the dependency onNetCDF.jl
. -
This PR breaks the pattern with other output writers by creating a new file for just one output writer. That's fine but we should be consistent. I suggest that we put this new output writer in
output_writers.jl
, and address the issue of file structure in a subsequent PR. -
Functionality that can only conceivably be used with output writing (functions that return string labels like
xdim
, for example) should probably be defined near the output writer. -
The example should be a lot simpler and ideally will either be a bare-bones introduction of output writing or will introduce other new concepts in addition to output writing (much of the example is a duplicate of another example). If the new example is intended only to introduce how to write output, I think just initializing some barebones model (using mostly defaults in the
Model
constructor, with a random velocity field, is sufficient. It should also run for some small number of time steps since this is all that's needed to demonstrate output writing. Otherwise, the example could be expanded to demonstrate new concepts. However, the suite of examples is currently a work in progress in PR Examples: more, better, tested #425 (and other ongoing work). -
There's some confusion about the function (constructor?)
WriteGeometry
that needs to be cleared up.
src/fields.jl
Outdated
xdim(LX), ydim(LY), zdim(LZ) | ||
end | ||
|
||
xdim(::Type{Face}) = "xF" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these functions called "xdim
"? I also don't think these belong in fields.jl
because they don't really provide functionality for fields (they will only be used for labeling / output writing?). Perhaps move them to the output writers file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's supposed to return the x dimension of a field (either xC or xF). Similarly for ydim and zdim. Should we rename them to something else? I'll move them to the output_writers file in the next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you refer to xC
as the 'dimension' of a field? Maybe this makes sense, I just don't quite understand. Perhaps this is the 'location' of a field?
Note there is a function location(f)
which returns a tuple of types (eg (Face, Cell, Cell)
) corresponding to x, y, z
"location" of a field. This function seems to have a similar function for the x
, y
, and z
directions, except that it translates the types to a string label. Would a better name be x_location_label(Lx)
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you refer to xC as the 'dimension' of a field? Maybe this makes sense, I just don't quite understand. Perhaps this is the 'location' of a field?
Think of dimension like an axis here. I call it dimension because that's the convention followed by the netcdf metadata. There's a list of dimensions followed by a list of variables. Each variable, unless it's a scalar, should have at least one dimension from the list of dimensions. See the following netcdf file produced by this outputwriter for an example.
$ ncdump -h dump.nc
netcdf dump {
dimensions:
zC = 14 ;
zF = 14 ;
xC = 14 ;
yF = 14 ;
xF = 14 ;
yC = 14 ;
Time = UNLIMITED ; // (370 currently)
variables:
double zC(zC) ;
double zF(zF) ;
double xC(xC) ;
double yF(yF) ;
double xF(xF) ;
double yC(yC) ;
float Time(Time) ;
float v(Time, zC, yF, xC) ;
float S(Time, zC, yC, xC) ;
float w(Time, zF, yC, xC) ;
float T(Time, zC, yC, xC) ;
float u(Time, zC, yC, xF) ;
}
This format is powerful because there are many tools built for working with it e.g. python's xarray
.
Note there is a function location(f) which returns a tuple of types (eg (Face, Cell, Cell)) corresponding to x, y, z "location" of a field. This function seems to have a similar function for the x, y, and z directions, except that it translates the types to a string label. Would a better name be x_location_label(Lx), etc?
I just want a function that gives ("xF", "yC", "zC")
for u
and so on. Is there a more efficient way to get it from location
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thank you, I think I vaguely understand now (I've worked with NetCDF data before, but its been some time). In summary: the idea is that there is an array called xC
(for example) whose values correspond to the coordinates along the "dimension" that is labeled xC
of any field in the dataset?
That explains why the "dimension" xC
is distinct from the "dimension" xF
in NetCDF output. In the ordinary usage of the word "dimension", however, this distinction is confusing. The dimension in ordinary terminology is simply "x".
This terminology seems quite specific to the NetCDF output, and doesn't really correspond to anything else in Oceananigans. I think it's probably appropriate to put these functions in output_writers.jl
and give it some name like netcdf_spatial_dimensions(f)
.
As for your question about efficiency / code length --- I think you could get this functionality with
abbrev(L) = string(L)[1]
netcdf_spatial_dimemsions(f::Field{X, Y, Z}) where {X, Y, Z} =
("x" * abbrev(X), "y" * abbrev(Y), "z" * abbrev(Z))
I get
julia> using Oceananigans
julia> using Oceananigans: Face, Cell
julia> grid = RegularCartesianGrid((3, 3, 3), (1, 1, 1));
julia> a = Field(Face, Cell, Face, CPU(), grid);
julia> netcdf_spatial_dimemsions(a)
("xF", "yC", "zF")
maybe you can come up with a better name for abbrev
(though it's kind of cute as is).
src/ncoutputwriter.jl
Outdated
@@ -0,0 +1,150 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a constructor for a struct called WriteGeometry
? I think it is not, in which case it should use snake_case
.
In the rest of the code, I think we use the word grid
to refer to the information that is saved by this function. We haven't introduced the concept of geometry
--- I'm not sure this is the right place. What do you think?
Why can't the grid information be written to the file saved by the output writer (why does it have to be written to a separate file?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a constructor for a struct called WriteGeometry? I think it is not, in which case it should use snake_case.
It is a plain function. I'll convert to snake case in the next commit.
In the rest of the code, I think we use the word grid to refer to the information that is saved by this function. We haven't introduced the concept of geometry --- I'm not sure this is the right place. What do you think?
We could call it write_grid
instead of write_geometry
. I used geometry because MOM6 called it that.
Why can't the grid information be written to the file saved by the output writer (why does it have to be written to a separate file?)
The outputwriter could be used to write global or sliced output. So if someone writes only sliced output, the grid in that file will also be sliced. In that case it will be useful to provide functionality to write the entire grid to an independent file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could call it write_grid instead of write_geometry. I used geometry because MOM6 called it that.
I feel it makes sense to use grid
here since we use it elsewhere right now so that the name of the function is self-explanatory. If we think it makes sense to use geometry
rather than grid
throughout the code, we can perhaps discuss that in a future PR or issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outputwriter could be used to write global or sliced output. So if someone writes only sliced output, the grid in that file will also be sliced.
Okay. Presumably this doesn't matter for grids with separable dimensions (like RegularCartesianGrid
, but could matter for grids with curvature?
I don't quite see why it helps to "slice" the grid for sliced output (is that just a feature to save memory)? Maybe you can explain why it's important to save subsets of the grid for certain purposes? Why not save the whole grid (and record the location of the sliced output within the grid)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it makes sense to use grid here since we use it elsewhere right now so that the name of the function is self-explanatory. If we think it makes sense to use geometry rather than grid throughout the code, we can perhaps discuss that in a future PR or issue?
No, I think grid is fine. It conveys what it is. I changed geometry to grid.
Okay. Presumably this doesn't matter for grids with separable dimensions (like RegularCartesianGrid, but could matter for grids with curvature?
If the curved grid is a lat-lon grid, we can just store latC, lonC, latF, lonF, etc.
A useful feature would be outputting dx and dy for all grids. So for implementing, say, a differential operator for postprocessing end users don't have to worry about the nature of the grid.
I don't quite see why it helps to "slice" the grid for sliced output (is that just a feature to save memory)? Maybe you can explain why it's important to save subsets of the grid for certain purposes? Why not save the whole grid (and record the location of the sliced output within the grid)?
If you refer to my previous reply (where an ncdump is shown), a variable, say, u
has dimensions (Time,zC,yC,xF)
. Netcdf format imposes a condition that size(u,1) == length(Time)
, size(u,2) == length(zC)
, and so on. Tools like xarray
leverage this property quite effectively. For example, slicing and plotting is a one liner in xarray:
import xarray as xr
with xr.open_dataset('dump.nc') as ds
ds.u.sel(Time=1,xF=10).plot()
That's it. This plots a yz section at x=10, Time=1. It figures out axes, labels them appropriately, etc in the background.
If you don't slice the grid along with the varibale, dimension sizes won't align with the variable sizes and netcdf would just error out. You could just have integer indices as netcdf dimensions but then we lose all the benefits of the format. Does this help?
It would be nice to eventually have jld2 output writer do this too because then we might be able to write a jld2 to xarray interpreter. Xarray is, by far, the best tool right now to deal with multidimensional arrays with labeled axes. We could potentially attract more users if we start supporting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I understand. I think I see how what you describe is a simple solution to the problem of describing subsets of data within a grid --- though in some ways this solution is necessary because NetCDF doesn't have an abstraction for these subsets of data... ?
It would seem that the need to save the entire grid separately from subsets / slices of the grid within a particular NetCDF file is actually some kind of limitation, rather than a feature (presumably we would rather not have to do this).
Either way, I agree it's good to have functionality that facilitates usage by popular packages. Since I don't completely understand the issue, if you say its necessary then I'm fine to keep this functionality to facilitate such a usage pattern.
As for facilitation of interop between JLD2
and xarray
, would it make sense to write a wholly separate output writer devoted to outputting fields and subsets of fields in JLD2
format that can be read by xarray
? As it is the JLD2OutputWriter
is quite simple and therefore flexible (arbitrary data can be saved, including the result of complicated on-line calculations, arbitrary julia types, etc).
As an aside, it may also be interesting to take a look at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I understand. I think I see how what you describe is a simple solution to the problem of describing subsets of data within a grid --- though in some ways this solution is necessary because NetCDF doesn't have an abstraction for these subsets of data... ?
Exactly. Tools like xarray provide this abstraction on top of netcdf. I presume most users would want to store global datasets and then take slices using their tool of choice. This netcdfwriter stores global datasets if slices are not provided (well not quite, we need to discuss halos). Storing slices directly from the model makes sense only if someone is constrained by storage rather than computational resources.
It would seem that the need to save the entire grid separately from subsets / slices of the grid within a particular NetCDF file is actually some kind of limitation, rather than a feature (presumably we would rather not have to do this).
Yes, but we write the global grid only once at the beginning of the run, so it's not expensive. Also, for sliced output, slices are computed when we initialize the outputwriter. So no new computation is carried out at every output iteration. We do use getindex
, which creates copies of arrays, to get slices of fieldarrays at every output iteration. We could replace it with view
if memory becomes a constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for facilitation of interop between JLD2 and xarray, would it make sense to write a wholly separate output writer devoted to outputting fields and subsets of fields in JLD2 format that can be read by xarray? As it is the JLD2OutputWriter is quite simple and therefore flexible (arbitrary data can be saved, including the result of complicated on-line calculations, arbitrary julia types, etc).
I don't understand the JLD2 file type well enough to make a call on this. @ali-ramadhan what do you think.
Netcdf does have support for scalars and attributes. So may be we can store some arbitrary data in netcdf as well. Will look into it eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Netcdf does have support for scalars and attributes. So may be we can store some arbitrary data in netcdf as well.
NetCDF cannot store arbitrary type information, however, which means that it cannot store julia types (like a BoundaryCondition
). In JLD2
a BoundaryCondition
can be saved directly to disk and then loaded back without shenanigans or the need to manually reconstruct the julia type.
Yes, but we write the global grid only once at the beginning of the run, so it's not expensive.
I'm referring to the annoyance of having to manually save the grid in a separate file by calling write_grid
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm referring to the annoyance of having to manually save the grid in a separate file by calling write_grid.
If the users write out a global output file it won't be necessary to call write_grid
.
NetCDF cannot store arbitrary type information, however, which means that it cannot store julia types (like a BoundaryCondition). In JLD2 a BoundaryCondition can be saved directly to disk and then loaded back without shenanigans or the need to manually reconstruct the julia type.
I understand.
Hi @glwagner Just ran into this and wanted to mention NCTiles.jl which we, https://github.com/lmilechin and myself, recently released. Not sure if |
- nukes old NetCDFOutputwriter - renames geometry to grid
…s on global grids
- Better specification of slices - Can now deal with integers instead of slices
- Should fix the failing GPU tests
Hi @gaelforget , Thank you for bringing this to our notice! From what I understand, NCTiles provides convenient functions to either use NCDatasets.jl or NetCDF.jl and integrates with MeshArrays.jl, which has representations for tri-polar grids and such. I haven't had the time to read your 2015 paper so please correct me if I did not grasp the full extent of NCTiles' capabilities. So far Oceananigans does not have support for anything other than regular cartesian grids so NCTiles might be an overkill. But we can definitely consider it in the future when Oceananigans gains new capabilities. By the way, did you notice any differences in performance between NCDatasets.jl and Netcdf.jl? |
This pull request is now ripe for merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Can't wait to switch to NetCDF.
@suyashbire1 do you want to remove the NetCDF.jl dependency? Can also remove [WIP]
from the PR title.
PS: Sorry for the slow review. Hopefully it's just tiny changes.
OWClose(subsetwriter) | ||
OWClose(globalwriter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now but I wonder if there's a way to automatically close NetCDF files so the user doesn't have to worry about this stuff.
Unfortunately Julia doesn't have destructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh nevermind, Julia does have finalizers:
finalizer(f, x)
Register a function f(x) to be called when there are no program-accessible
references to x, and return x. The type of x must be a mutable struct,
otherwise the behavior of this function is unpredictable.
but not sure if this would be a proper use of them. See JuliaLang/julia#11207
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this function should also be changed to snake_case
? Let's make sure we use the proper style for all new functions in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what finalizers/destructors do. Sounds like garbage collection.
Anyway, we already have a list of outputwriters. Could we iterate over it and have close(outputwriter)
in a try catch
block at the end of the run (may be at the end of the proposed function that will replace the time-step loop)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this function should also be changed to snake_case? Let's make sure we use the proper style for all new functions in this PR.
Done.
"xC" => ["longname" => "Locations of the cell centers in the x-direction.", "units" => "m"], | ||
"yC" => ["longname" => "Locations of the cell centers in the y-direction.", "units" => "m"], | ||
"zC" => ["longname" => "Locations of the cell centers in the z-direction.", "units" => "m"], | ||
"xF" => ["longname" => "Locations of the cell faces in the x-direction.", "units" => "m"], | ||
"yF" => ["longname" => "Locations of the cell faces in the y-direction.", "units" => "m"], | ||
"zF" => ["longname" => "Locations of the cell faces in the z-direction.", "units" => "m"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, "units" => "m"
is true by default, but will be wrong for NonDimensionalModel
. Don't think we have to worry about that in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit confusing. Technically speaking there are no assumptions about units in the code. The user is free to pick their parameters with the desired units and as long as everything the user does is consistent, the results will be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be difficult to add a units property to all fields and grid axes? That way we could just call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be difficult. Alternatively, we can look into using Unitful
, which might be more natural / less hack-y. Is it necessary to have units? I think it could be best to leave the units unspecified, and create an issue about adding units to the model, since this is a larger problem than should be addressed in this PR probably.
src/output_writers.jl
Outdated
frequency :: Union{Nothing, Int} | ||
clobber :: Bool | ||
slices :: Dict | ||
nt :: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to explicitly write out what nt
is? I'm actually not sure what it stands for lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have made it clear. It's the length of the time dimension. We keep appending data along the time dimension so this variable is incremented by 1 at each output iteration. It has been renamed to len_time_dimension
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Pinging @glwagner for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to merge, however I have a few questions / comments that could be implemented here, or added to a new issue for the future:
-
Can users save only some of the model fields, rather than having to save all of them? This can be done similar to how we handle multiple outputs in the
JLD2OutputWriter
using either aDict
orNamedTuple
with(name, field)
pairs (or(name, FieldOutput)
pairs; see 3 below). -
Can we add "units" to the output writer? This is probably an easier option than adding units to
grid
or each field. -
Going along with 1 and 2, probably what we really want to do is to extend the
FieldOutput
type to be something like
struct FieldOutput{F, RT, D, U, S}
field :: F
return_type :: RT
longname :: D
units :: U
slice :: S
end
This will allow the user to specify the longname
and units
for each field, and also to specify slices of each field individually.
We can additionally write constructors so that if the user specifies something like
fields = FieldOutputs(model.velocities)
then the user gets default longname
and units
for the velocity field. Ditto for model.diffusivities
and model.tracers
, though in the case of tracers we will have to support defaults for a number of tracer names (and its a bit wonky because we have to hope that the user really does mean "salinity" when their tracer is named :S
, for example). Then the user can use a tuple of FieldOutput
s to specify what they want to some out of NetCDF
. We can also make the default fields be the merger of model.velocities
and model.tracers
.
"xC" => ["longname" => "Locations of the cell centers in the x-direction.", "units" => "m"], | ||
"yC" => ["longname" => "Locations of the cell centers in the y-direction.", "units" => "m"], | ||
"zC" => ["longname" => "Locations of the cell centers in the z-direction.", "units" => "m"], | ||
"xF" => ["longname" => "Locations of the cell faces in the x-direction.", "units" => "m"], | ||
"yF" => ["longname" => "Locations of the cell faces in the y-direction.", "units" => "m"], | ||
"zF" => ["longname" => "Locations of the cell faces in the z-direction.", "units" => "m"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be difficult. Alternatively, we can look into using Unitful
, which might be more natural / less hack-y. Is it necessary to have units? I think it could be best to leave the units unspecified, and create an issue about adding units to the model, since this is a larger problem than should be addressed in this PR probably.
The |
"yF" => collect(model.grid.yF)[1:end-1], | ||
"zF" => collect(model.grid.zF)[1:end-1] | ||
) | ||
dim_attrib = Dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe add a keyword argument grid_units
with a default of meters
? That way the user can specify correct units if they need to. It's an incomplete solution but at least provides the possibility of correct behavior.
I'd love to start using NetCDF for some simulations I'm running this week. @suyashbire1 maybe we can find some time to meet and polish off this PR and merge it? Could also do the same for PR #438. |
…assoc_dimensions A fast netcdf output writer Former-commit-id: 5114a22
TODO:
Edit: